Introduction

In order to test the accuracy of the gold-standard data, we can leverage a wisdom-of-the-crowds approach where we compare the set of predicted values for a given measurement (e.g. all predicted values for Patient 001, Joint 001, erosion) to the gold standard measurement. We can calculate z-scores for this population of predictions + gold standard value. If the z-score of the gold standard is considerably higher or lower than the rest of the population (i.e. very far from 0), it’s possible that the gold standard measurement is incorrect.

First, load packages and download View of all submissions. We’ll consider only the final submission for each team to avoid weighting towards teams that submitted multiple similar predictions.

library(tidyverse)
library(reticulate)
library(reactable)

use_condaenv("ra2dream", required = T)
synapseclient <- reticulate::import('synapseclient')
syn <- synapseclient$Synapse()

Then, calculate the z-scores for each prediction and gold standard value, grouped by Patient_ID and measurement (i.e. z scores calculated within each group, not across every group).

##filtered out Sha2256:6cb365.... because this was the wrong baseline model; we re-ran with the correct baseline (sha:2d72ae6...)

tab <- syn$tableQuery("select * from syn22236264 WHERE status = 'ACCEPTED' AND 
    submitterid <> 3408914 AND
    dockerdigest not in ('sha256:6cb365830745c5d368bcd128a3a1b750ace801da281356a5a623237408c10a3d')")$asDataFrame() %>% 
  group_by(submitterid) %>% 
  slice(which.max(createdOn)) %>%
  select(-createdOn) %>% 
  ungroup

submission <- lapply(tab$prediction_fileid, function(x){
  syn$get(x)$path %>% 
    readr::read_csv() %>% 
    tidyr::gather(measurement, score, -Patient_ID) %>% 
    mutate(prediction = {{x}})
}) %>% bind_rows()

gold <- syn$get("syn22254942")$path %>% 
   readr::read_csv() %>% 
   tidyr::gather(measurement, score, -Patient_ID) %>% 
  mutate(prediction = 'gold')

submission_zscores <- bind_rows(submission, gold) %>%
  group_by(Patient_ID, measurement) %>%
  mutate(zscore = (score - mean(score))/sd(score)) %>%
  mutate(zscore = signif(zscore, 3)) %>%
  ungroup

SC1

Plot the SC1 gold standard z-scores. The z-scores for these Overall_Tol values are not too high - I would surmise because these values are the sum of a large number of measurements, and thus are “buffered” from large changes caused by individual joint measurement errors.

The table below shows the same data.

p <-ggplot(submission_zscores %>% 
         filter(prediction=="gold") %>% 
         filter(measurement == "Overall_Tol")) +
  geom_point(aes(x = score, y = zscore, text = Patient_ID, label = measurement)) +
  theme_bw() +
  ggtitle("SC1 Gold Standard Z-scores")
  
plotly::ggplotly(p)
submission_zscores %>% 
         filter(prediction=="gold") %>% 
         filter(measurement == "Overall_Tol") %>%
        select(-prediction) %>% 
  arrange(desc(zscore)) %>% 
  reactable(sortable = T, filterable = F, bordered = T, compact = T,
          style = list(fontFamily = "-apple-system, BlinkMacSystemFont, Segoe UI, Helvetica, Arial, sans-serif"))

SC2

Plot the SC2 gold standard z-scores. This plot is not interactive because there are many data points. The table at the end of this document shows the same data for SC2 joint narrowing gold standard scores and is sortable.

p <-ggplot(submission_zscores %>% 
         filter(prediction=="gold") %>% 
         filter(grepl('.+_J__.+', measurement))) +
  geom_jitter(aes(x = score, y = zscore, text = Patient_ID, label = measurement)) +
  theme_bw() +
  ggtitle("SC2 (narrowing) Gold Standard Z-scores (x-axis jittered for visualization)")

p

submission_zscores %>% 
         filter(prediction=="gold") %>% 
           select(-prediction) %>% 
         filter(grepl('.+_J__.+', measurement))  %>%
  arrange(desc(zscore)) %>% 
  reactable(sortable = T, filterable = F, bordered = T, compact = T,
          style = list(fontFamily = "-apple-system, BlinkMacSystemFont, Segoe UI, Helvetica, Arial, sans-serif"))

SC3

Plot the SC3 gold standard z-scores. This plot is not interactive because there are many data points. The table at the end of this document shows the same data for SC3 joint narrowing gold standard scores and is sortable.

p <-ggplot(submission_zscores %>% 
         filter(prediction=="gold") %>% 
        select(-prediction) %>% 
         filter(grepl('.+_E__.+', measurement))) +
  geom_jitter(aes(x = score, y = zscore, text = Patient_ID, label = measurement)) +
  theme_bw() +  
  ggtitle("SC3 (erosion) Gold Standard Z-scores (x-axis jittered for visualization)")

p

submission_zscores %>% 
         filter(prediction=="gold") %>% 
         filter(grepl('.+_E__.+', measurement))  %>%
  arrange(desc(zscore)) %>% 
  reactable(sortable = T, filterable = F, bordered = T, compact = T,
          style = list(fontFamily = "-apple-system, BlinkMacSystemFont, Segoe UI, Helvetica, Arial, sans-serif"))
write_csv(submission_zscores %>% 
         filter(prediction=="gold") %>% 
         select(-prediction) %>% 
          arrange(desc(zscore)),
  'gold_standard_z_scores.csv')